Statistical approach for optimal use of genetic information collected on historical pedigrees

ABSTRACT

This invention provides a novel means of predicting plant phenotypes that incorporates previously unusable dense marker data derived from historical pedigrees. The method operates by collecting information from a population pertaining to one or more loci, which is used to build one or more matrices by calculating, for the alleles present at the measured loci, the probability that the alleles are identical by descent. These matrices are then used to develop a second set of one or more matrices in which each value represents the probability that a certain individual in the population descended from a certain ancestral (founder) genotype. This set of second matrices can then be used as part of a breeding program for selecting and breeding individuals from the population or can be used to better classify the individuals in the population, leading to improved plant phenotypes.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.12/572,663, filed Oct. 2, 2009, which claims the benefit of U.S.Provisional Application No. 61/102,144, filed Oct. 2, 2008, the contentsof which are herein incorporated by reference in their entirety.

BACKGROUND

The manipulation of crop genetics for the optimization of agronomictraits has resulted in a revolution in the seed industry. However, asmany as 98% of these agronomic traits are quantitative traits, such thatthey are controlled by two or more genes and have measurable variabilityamong the individual phenotypes. In order to understand and control theinheritance of these genes and the resultant phenotypes, scientists inthe field have traditionally utilized methods such as quantitative traitlocus (QTL) analysis.

As an outcome of QTL analysis, scientists identify chromosomal regionsthat are in close proximity to genes controlling a trait of interest.These chromosomal regions may be the target gene itself or may begenetic markers such as RFLPs, AFLPs, RAPDs, VNTRs, microsatellitepolymorphisms, SNPs, and STRs. Because the markers are in closeproximity to the genes, they tend to be inherited along with the gene (aphenomenon known as genetic linkage). As a result, the marker can beused to track the inheritance of the genes of interest. The process andstatistical methods used to identify the location and effects of thevarious genes of interest or markers associated with these genes iscalled QTL mapping.

A major limitation of QTL analysis is that it has historically beenlimited to populations derived from single crosses of inbred lines andthus has not incorporated the extensive plant pedigree data that hasbeen collected by plant breeders. Previous attempts to explicitlyinclude this information have proven to be computationally intensive andare difficult to parameterize. As a result, vast amounts of datacollected over the course of the development of modern elite plantvarieties is essentially unused for purposes of phenotypic prediction.

Analysis of more complex populations derived from multiple founders orcollected from ongoing breeding programs has the potential tosignificantly improve the understanding of important agronomic traits.For example, if the stability and magnitude of individual genes acrossdifferent genetic backgrounds can be quantified more accurately,improved response to selection can be obtained. The use of these morecomplex populations would provide better context for making inferencesby better approximating commercial breeding populations and wouldincrease cost effectiveness by allowing use of available phenotypes fromongoing selection experiments. While some efforts have been undertakento use historical pedigree data to predict phenotypes within mappingpopulations, those efforts have been largely unproductive in large partbecause of the inability to efficiently use the extensive pedigree andmarker information that must be combined and analyzed.

The important benefits that can be derived from incorporation ofcomprehensive pedigree data coupled with the inability of those skilledin the art to effectively perform such analysis underscores the longfelt, unsatisfied need for such tools in the art.

SUMMARY

This invention provides a novel means of predicting plant phenotypesthat incorporates previously unusable dense marker data derived fromhistorical pedigrees. The method operates by collecting information froma population pertaining to one or more loci. The information is thenused to build one or more matrices by calculating, for the allelespresent at the measured loci, the probability that the alleles areidentical by descent (IBD). These matrices are then used to develop asecond set of one or more matrices in which each value represents theprobability that a certain individual in the population descended from acertain ancestral (founder) genotype. This set of second matrices canthen be used as part of a breeding program for selecting and breedingindividuals from the population or can be used to better classify theindividuals in the population, leading to better identification ofimproved plant phenotypes. The disclosed methods take the full amount ofhistorical pedigree and marker data that is available or desired to beutilized for predictive purposes and places it into a manageable formsuch that it may be used to improve the mapping resolution while stillmaking the analysis of the information practical from a processingperspective. As a result, this method results in better mappingresolution and higher prediction accuracy than has been realisticallyfeasible.

The genetic markers used can be any that are used in traditional QTLmapping analysis such as RFLPs, AFLPs, RAPDs, VNTRs, microsatellitepolymorphisms, SNPs, and STRs. The phenotypic traits include anypolygenic phenotypic traits of interest.

In one embodiment of the invention, the population consists of diploidplants, either hybrid or inbred. In another embodiment, the populationconsists of polyploid plants. Plants relevant to the invention include,for example, maize, wheat, rice, millet, barley, sorghum, rye, soybean,alfalfa, oilseed Brassica, cotton, sunflower, potato, or tomato.

The first matrix (of IBD probabilities) is calculated using standardstatistical techniques, including those based on Markov Chain MonteCarlo algorithms. The genotypes in the second matrix are automaticallygenerated using mathematical models. In a preferred embodiment, thesecond matrix is calculated using a latent (ancestral) class model,although other models such as a threshold model, distance model, avector model, or a cluster analysis model may be used. In a preferredembodiment, the second matrix is used in a Bayesian statistical approachfor more accurate QTL detection and prediction of plant phenotypes withcalculations performed using a Markov Chain Monte Carlo based algorithmwhile keeping computational efforts to a minimum. These calculations mayalso be computer-implemented.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a graphical depiction of chromosomal position of two differentloci in a population of eight individuals.

FIG. 2 is a graphical depiction of regions of similarity and differencefor the genomes of parental lines, as identified by identity by descentinformation obtained using a dense parental marker map, and a depictionof the resulting effective progeny marker map.

FIG. 3 is the result of a QTL analysis on grain yield data collected on976 FS maize topcross progenies at a single location in the UnitedStates Corn Belt.

FIG. 4 is a histogram of the estimated genetic value for all 976 progenyobtained by summing the weighted effects of all QTL positions identifiedon all ten chromosomes.

FIG. 5 is a QTL intensity profile from analysis of data where parentsare considered to be unrelated and when IBD information is used toaugment the analysis.

FIG. 6 is a comparison of predicted maturity date when using traditionalpedigree-only approaches and when augmenting the analysis using IBD.

FIG. 7 is a graph comparing predicted values for new selected F1 hybridsto their observed measurements.

DETAILED DESCRIPTION

The present invention incorporates population pedigree data andassociated genetic marker data in order to better predict groupcharacteristics (e.g. phenotypes) within the population. This isaccomplished, for example, through the use of two matrices. Whilematrices are used herein, it is understood that alternative methods mayalso be used to achieve the same result.

The elements of the first matrix (Q) represent the probability thatpairs of alleles are identical by descent. Alleles are identical bydescent (IBD) if they are identical copies of an allele segregating froma common ancestor within the defined pedigree. In the context of inbredlines, allele and individual are exchangeable. As such, Q is an n×nsquare symmetric matrix of non-negative elements between 0 and 1, wheren represents the number of individuals in the population. The firstmatrix depends on the specified locus and is typically calculated fromthe pedigree data and genetic marker data. The matrix may also containidentity by descent probabilities measured across multiple loci,chromosomal segments, haplotypes, or the whole genome.

The second matrix (P) is an n×K matrix comprised of non-negativeelements that represent the assignment probabilities of individuals toclasses (e.g. ancestral genotypes), wherein K represents a fixed butunknown number of classes and every row of P sums to 1. With thisarrangement, the second matrix represents the first matrix in the sensethat Q is approximated by PP^(T), while disregarding the diagonalelements of Q and PP^(T). Example matrices for Q and P are found inTable 1 below:

TABLE 1 Artificial 6 × 6 Q matrix for six individuals labeled A-F and a6 × 4 matrix P with classes labeled C₁-C₄, giving a perfect fit to theoff-diagonal elements of Q by the formula PP^(T). $Q = \begin{matrix}\; & A & B & C & D & E & F \\A & 1 & 0 & 0 & 0 & 0 & 0 \\B & 0 & 1 & 1 & 1 & 0 & 0 \\C & 0 & 1 & 1 & 1 & 0 & 0 \\D & 0 & 1 & 1 & 1 & 0 & 0 \\E & 0 & 0 & 0 & 0 & 1 & 0.7 \\F & 0 & 0 & 0 & 0 & 0.7 & 1\end{matrix}$ $P = \begin{matrix}\; & C_{1} & C_{2} & C_{3} & C_{4} \\A & 1 & 0 & 0 & 0 \\B & 0 & 1 & 0 & 0 \\C & 0 & 1 & 0 & 0 \\D & 0 & 1 & 0 & 0 \\E & 0 & 0 & 1 & 0 \\F & 0 & 0 & 0.7 & 0.3\end{matrix}$

This approximation can be accomplished through methods such as, but notlimited to, least squares using a number of numerical approaches, suchas a brute force genetic algorithm (differential evolution) or twodifferent iterative row-wise quadratic programming approaches.Alternatively P can be constrained to contain values 0 and 1 only,representing a fixed grouping of individuals.

The utility of the second matrix (P) is manifold (but not limited to):

-   1. The matrix P is much smaller in size than the matrix Q if K<n    which makes it easier to deal with both for human inspection and for    computer representation.-   2. The matrix P gives an explicit probabilistic representation of    inheritance of alleles of individuals from a set of latent    (ancestral) founders. The elements of P have a clear meaning; they    are the identity by descent probabilities of the n individuals at a    specified locus with the K latent ancestors.-   3. Each row of P is associated with a specified individual and    indicates the number of ancestors that effectively contributed to    the genotype of that individual at a specified locus;-   4. The value of K that gives a good approximation to Q indicates the    number of ancestors that actually contribute to the genotype of the    individuals at a specified locus.-   5. In many cases in which a pedigree is available the latent    ancestors can be identified as being the most recent common    ancestors in the pedigree.-   6. The matrix P makes it possible to sample or draw ancestors for    each of the n individuals in such a way that the probability that    individual i and j have a common ancestor is their identity by    descent probability for all i≠j (i=1, . . . , n; j=1, . . . , n).    Each such sample is an explicit possible way of inheritance of the    individuals from the set of latent ancestors.    Utilities 2 and 6 are of foremost importance in regression    approaches with genetic predictors (Malosetti et al, 2006) and in    Bayesian methods (Bink et al. 2008) for QTL analysis that cannot    work with the first matrix or matrices (Q) directly. Utility 2 makes    it possible to calculate genetic predictors (matrix P), the    probability density for a proposed way of inheritance. Utility 6    makes it possible to generate proposals for ways of inheritance.

Developing a Latent Ancestral Class Model for an Identity by DescentMatrix

As previously stated, we begin with an n×n square symmetric matrix (Q)with elements q_(ij) between 0 and 1, denoting the probability that thealleles of a pair of individuals i and j (i=1, . . . , n; j=1, . . . ,n) at a specified locus are identical by descent, that is, inheritedfrom the same ancestral allele. The matrix Q depends on the position ofthe locus on the chromosome, but is fixed when a specified locus isconsidered. The goal is to develop a model that has an explicit,possibly probabilistic, representation for the inheritance of the alleleof each individual from a common set of ancestral founders, but withoutfurther usage of the pedigree and/or marker data. The pedigree and/ormarker data is used to generate the matrix Q. The matrix Q thuscondenses all information about the pedigree and marker data that we canuse in the following. Because there is no pedigree information beyondthe information contained in the matrix Q, the ancestral founders of theintended model are unknown and therefore “latent” as they can only behypothesized. The intended model is the latent ancestral class modeldeveloped here.

Using this approach, we arrive at a model in which the allele of eachindividual is inherited from a single latent ancestor. The transitivityproperty applies to this problem such that if the alleles forindividuals i and j are inherited from the same ancestor, and thealleles for individuals i and k are inherited from the same ancestor,then the alleles for individuals j and k are inherited from the sameancestor.

We extend this inheritance model with probabilities. Let P be an n×Kmatrix with K the number latent ancestors and elements p_(ik) being theprobability that the allele of individual i is inherited from ancestork. Note that

$\begin{matrix}{{{p_{ik} \geq {0\mspace{14mu} {and}\mspace{14mu} {\sum\limits_{k = 1}^{K}p_{ik}}}} = 1}{( {{i = 1},\ldots \mspace{14mu},{n;{k = 1}},\ldots \mspace{14mu},K} ).}} & (1)\end{matrix}$

In this model we do not know whether the allele of individual i isinherited from ancestor k, but only the probability of this inheritance.On assuming independence of inheritance for each pair of individuals,the probability that individuals i and j inherited from the sameancestor is, according to the model,

$\begin{matrix}{{q_{ij}^{*} = {{\sum\limits_{k = 1}^{K}{P( {i \in {{{class}(k)}\bigwedge j} \in {{class}(k)}} )}} = {\sum\limits_{k = 1}^{K}{p_{ik}p_{jk}}}}}{\forall{i \neq {j.}}}} & (2)\end{matrix}$

Mathematically, the {q*_(ij)} are coincidence probabilities induced by alatent class model with memberships probabilities P.

As a result, we wish to find a matrix P such that q*_(ij) is close tothe observed q_(ij), for all i≠j. This can be solved using aleast-squares approximation such that the problem is to minimize theloss function:

$\begin{matrix}{{{f(P)} = {\sum\limits_{i = 1}^{n}{\sum\limits_{j = {l + 1}}^{n}( {q_{ij} - {p_{i}^{T}p_{j}}} )^{2}}}},} & (3)\end{matrix}$

where p_(i) ^(T) denotes the i^(th) row of P, subject to the nKnon-negativity and n equality constraints in Equation (1).The loss can be reported in terms of the root mean squared error (RMSE),defined as:

√{square root over (2ƒ(P)/(n(n−1)))}{square root over (2ƒ(P)/(n(n−1)))}.

If the loss is small, we have thus obtained an explicit inheritancemodel for the alleles of the individuals that accurately approximate theidentity by descent probabilities as given by the matrix Q, which wascalculated from the available pedigree and/or marker information. Theinheritance probabilities of alleles of individuals from latentancestors are given in the matrix P. The key identity to arrive atidentity by descent probabilities is equation (2) which can also bewritten in matrix notation as Q*=PP^(T) while disregarding the diagonalelements of Q*. Here Q* is the approximation of Q. Note that equation(2) also holds true if the elements of P are only 0 or 1, and thus Prepresents a division of the individuals into disjoint groups. In thiscase, P can in principle be derived from any method of cluster analysisapplied to Q; otherwise any form of fuzzy clustering can be applied.Here we develop specialized methods to obtain the best P to approximateQ by PP^(T).

The loss function set forth in equation (3) is not convex which raisesthe possibility of multiple local minima, even beyond local minimagenerated by rearrangement of the columns of P. As a result, a globaloptimization method is required. A differential evolution approach andtwo versions of iterative row-wise quadratic programming approaches willserve as three examples of such methods.

Global Optimization Using a Differential Evolution Approach

Differential Evolution (DE) is a derivative-free global optimizationmethod that belongs to the family of evolutionary algorithms. In DE apopulation of solution vectors x₁ . . . x_(N) is used to explore thesolution space and is maintained for a number of generations until thepopulation reaches the minimum. In our example, each member vectorx_(i)=vec(P_(i)) with P_(i) a trial solution of (3) subject to (1). Thesize of the population depends on the problem and the other parametersof the differential evolution, but in general it should be higher thanthe number of parameters d. For this example N=2d=2nK is used.

Each member's vector x_(i)(i=1 . . . N) can be initialized independentlywith nK random values between 0 and 1. The K values of each of the nindividuals can be divided by their sum so as to satisfy constraints(1). The exploration of the space is carried as follows in DE.

A new “mutant” vector is proposed for each member vector x_(i) by usingthree different randomly selected vectors of the population:

x* _(i) =x _(r) ₀ +F(x _(r) ₁ −x _(r) ₂ )  (4)

where F is a scalar that determines step size.

In addition to (4), another generic operation is used in DE, namelycrossover. With a crossover rate CR (0<CR≦1), a fraction CR of theelements in x_(i) are mutated according to (4), whereas the remainingparameters are left unchanged, i.e. are set equal to the correspondingvalues in x_(i).

In this algorithm, the crossover operation is applied on the level ofthe individuals within the parameter vector, i, e, the parameters of anindividual are either mutated according to (4) or left unchanged. Interms of P_(i), each row of P_(i) is thus independently selected formutation, the selection probability being CR.

To ensure that the resulting mutant vector satisfies the constraints of(1), any negative value in x*_(i) is replaced with 0, any value greaterthan 1 is replaced with 1, and the resulting K elements of each of the nindividuals in x*_(i) are divided by their sum.

After the aforementioned adjustments, the member vector x_(i) isreplaced by x*_(i) if ƒ(x*_(i))≦ƒ(x_(i)) where ƒ(.) is the lossfunction. Each possible update of x_(i) requires one function evaluationat the time that loss function values of members are stored.

Global Optimization Using Iterative Row-Wise Quadratic Programming

Another method to minimize the loss function ƒ(P) is to use iterativerow-wise quadratic programming. When using quadratic programming, ƒ(P)is minimized over p_(i) while keeping the other rows of P fixed, whichcan be achieved by minimizing:

∥q _(i) −P _(−i) p _(i)∥²  (5)

where q_(i) is the i^(th) column of Q without q_(ii), and P_(−i) denotesmatrix P after deleting row i subject to the constraints p_(i)≧0 andp_(i) ^(T) 1=1 where 0 and 1 denote vectors of appropriate lengths withall zero and unit elements respectively. Problem (5) can be recognizedas one of quadratic programming. We developed two specialized approachesfor this problem, called Approach A and Approach B.With an algorithm to fit a single row, the complete algorithm forfitting P to Q is:Complete algorithm:(1) Initialize P, e.g. with each row filled with random uniform numbersbetween 0 and 1, which are then divided by their sum, so as to satisfythe constraints (1).(2) While ƒ(P) decreases:

For i=1, N minimize ƒ(P) over the ith row p_(i), while keeping the otherrows of P fixed, by, for example, Approach A or Approach B.

We now provide details of Approach A and Approach B.

Approach a for Global Optimization Using Iterative Row-Wise QuadraticProgramming

The constraint p_(i) ^(T) 1=1 can be enforced by reparamaterization toyield p_(i)=Uv+α1 with U a column wise orthonormal basis for theorthocomplement of 1, and v and α a vector and scalar respectively.Because U and 1 jointly span the whole space, the vector v and scalar αalways exist, Now the constraint that p_(i) ^(T) 1=1 implies thatvU^(T)1+α1^(T)1=1, and because U^(T)1=0 and 1^(T)1=K, it follows thatα=K⁻¹. Hence p_(i)=Uv+K⁻¹1 which satisfies the constraint p_(i) ^(T)1=1for every v.

Based on reparameterizing p_(i) as above, it remains to minimize thefunction:

g(v)=∥q _(i) −P _(−i) K ⁻¹1−P _(−i) Uv∥ ²

over v subject to the constraint that Uv+K⁻¹1≧0. This in turn can besolved by minimizing h(v)=∥f−Ev∥² subject to Gv≧h, where:

f=q _(i) −K ⁻¹ P _(−i)1

E=P _(−l) U

G=U

h=−K ⁻¹.

Once the optimal v is found (see Lawson and Hanson 1995), the vectorp_(i) that minimizes ƒ(P) over p, is given by Uv+K⁻¹1.

Approach B for Global Optimization Using Iterative Row-Wise QuadraticProgramming

As each row of P sums to unity, a column of P can be deleted as we shownow. We delete the last column, i.e. column K. With the (K−1)-vectorb=(p_(i1), p_(i2), . . . , p_(i(K-1)))^(T), we can write

p _(i) =[b ^(T),1−b ^(T)1_((K-1))]^(T) =Cb+d  (6)

with K-vector d=(0 . . . 0,1)^(T), with the “1” in position K, andK×(K−1) matrix

${C = \begin{bmatrix}I_{K - 1} \\{- 1_{K - 1}}\end{bmatrix}},$

where I_(K-1) is a (K−1)×(K−1) identity matrix and 1_(K-1) is a(K−1)-vector of ones Then by inserting (6) in (5) for both p_(i) andeach row of P_(−i) and by defining the (N−1)×(K−1) matrix X withelements x_(jk)=p_(jk)−p_(jK) and the N−1 vector y with elementsy_(j)=q_(ij)−p_(jK) for j=1, . . . i−1, i+1, N and k=1 . . . (K−1), wearrive at the following equivalent problem: find b so as to

$\begin{matrix}{{{minimize}\mspace{14mu} {{y - {Xb}}}^{2}\mspace{14mu} {subject}\mspace{14mu} {to}\mspace{14mu} b_{k}} \geq {0\mspace{14mu} {and}\mspace{14mu} {\sum\limits_{k = 1}^{K - 1}b_{k}}} \leq 1.} & (7)\end{matrix}$

After having found the solution to problem (7), we obtain the solutionto problem (5) by back-transformation of (6), namely p_(ik)=b_(k) fork=1, . . . , K−1 and

$p_{iK} = {1 - {\sum\limits_{k = 1}^{K - 1}{b_{k}.}}}$

There are several ways to solve problem (7) because it is a standardquadratic program. We mention in particular the LSI algorithm in Lawsonand Hanson (1974).

Methods to Choose K

Two approaches to choose K in the latent ancestral class model are asfollows. In the first approach, we minimize the Akaike informationcriterion (AIC) which for unknown variance is defined as AIC=Nlog(ƒ(P))+2p* with N=n(n−1)/2, the number of observations, andp*=n(K−1), the number of parameters. In the second approach we performtwo steps: first set the number of ancestral classes equal to itsmaximum (the number of columns in Q); second determine how many columnsof the matrix P contain nonzero elements (representing the actual numberof latent ancestors that most accurately reproduce Q). An alternative tothe second step in the second approach is to use the matrix P tocalculate the effective number of the latent ancestors by using theequation

$n_{eff} = \frac{1}{\sum\limits_{k = 1}^{K}( {p_{+ k}/p_{++}} )^{2}}$

(HILL 1973) where+as index indicates the sum over the index; for examplep_(+k) is the column sum. Note that 1/n_(eff) is equal to Simpson'sindex (HILL 1973). The Simpson index can be interpreted in our contextas the probability that two randomly chosen individuals inherited fromthe same ancestor. Notice that the number of latent ancestors and theeffective number of latent ancestors can also be usefully defined forthe i^(th) individual by the number of nonzero elements in the vectorp_(i) and by

${n_{{eff},i} = \frac{1}{\sum\limits_{k = 1}^{K}p_{ik}^{2}}},$

respectively. Notice also that each column sum of P contains theexpected number of the individuals that inherit from the correspondinglatent ancestor.

Usage of the Latent Ancestral Class Model in QTL Analysis

In QTL analysis using (mixed model) regression analysis with geneticpredictors (Malosetti et al., 2006) the matrix P can directly beinserted as a, possibly additional, set of K genetic predictors. The useof matrix P in this framework can be more efficient in statistical andcomputational aspects than the use of matrix Q in a variance componentsimplementation.

QTL analysis using Bayesian methods (Bink et al., 2008) can in theoryutilize the full pedigree and marker data. In such an approach, QTLalleles are a priori independently assigned to the actual founders ofthe pedigree where the number of alleles is two or more (Bink et al.,2008). However, the numerical implementation by a Markov Chain MonteCarlo algorithm (Bink et al., 2008) is impractical, as it requires morecomputing resources and time than is available in practice. One solutionis to delete a large part of the pedigree and a priori independentlyassign alleles to the founders of the remaining part of the pedigree,which we call parents from now on. However, this solution ignores theleft-out part of the pedigree and neglects the fact that alleles ofdifferent parents are not necessarily independent. An alternative is toreplace the pedigree and marker information at each locus on the genomeby an identity by descent matrix (Q) among the parents. With n parents,each Q matrix is of size n by n. However, identity by descent matricescannot directly be used in oligogenic Bayesian methods for QTL analysiswhereas, possibly probabilistic, pedigree information can be used. Alatent ancestral class model provides probabilistic information that canbe used. The fact that actual founders are replaced by latent(hypothesized) ancestors does not hamper this usage. Note that thelatent ancestral class model is constructed to contain as much of thepedigree and marker information as possible. In the Bayesian analysisthe assumption that the alleles of the parents are independent (whichdiscards the left-out pedigree information) is replaced by theassumption that the alleles are inherited according to the latentancestral class model as follows: the cut-down pedigree (in which theparents become founders) is extended implicitly through the use of the Pmatrices towards the K ancestors. The alleles of these ancestors are apriori independent and the parents inherit their alleles from theseancestors with probabilities specified in the matrix P. This completesthe Bayesian model description.

One example of a numerical implementation of the modified Markov ChainMonte Carlo algorithm for this model is as follows. The algorithmproceeds by Gibbs steps whereas each Gibbs step contains a proposal anda Metropolis-Hastings accept-reject step. The proposal consists of botha new QTL location and a new set of alleles for latent ancestors andparents. The new QTL location is drawn from a symmetric distributioncentered at the current QTL location. The alleles for the latentancestors are drawn independently from the prior distribution. The newset of alleles for the parents is drawn by first sampling from theappropriate P matrix from which ancestor each parent inherited and thenassigning, for each parent, the allele (of the ancestor) that itinherited. As this proposal samples from the prior distribution forancestral and parental alleles and is symmetric in the old and new QTLposition, the Metropolis-Hastings acceptance ratio is simply a ratio ofthe likelihoods of the current and proposed values.

There is an additional detail that needs description. We may not haveaccess to the matrix P for the proposed QTL location, but only to twoflanking positions. Let P_(l) and P_(r) be the P-matrices pertaining tothe flanking positions λ₁ and λ_(r). Then, the P-matrix at the QTLposition λ

$\begin{matrix}{P_{\lambda} = \{ \begin{matrix}{P_{l}\mspace{14mu} {with}\mspace{14mu} {probability}\mspace{14mu} {( {\lambda - \lambda_{i}} )/( {\lambda_{r} - \lambda_{l}} )}} \\{P_{r}\mspace{14mu} {with}\mspace{14mu} {probability}\mspace{14mu} {( {\lambda_{r} - \lambda} )/( {\lambda_{r} - \lambda_{l}} )}}\end{matrix} } & (11)\end{matrix}$

Sampling from P_(l) and P_(r) with these probabilities yields preciselythe average IBD probabilitiesQ_(λ)=((λ−λ_(l))Q_(l)+(λ_(r)−λ)Q_(r))/(λ_(r)−λ_(l)) if P_(l) and P_(r)perfectly fit Q_(l) and Q_(r), respectively. In the case of a perfectfit, the sampling approach of equation (11) is therefore equivalent tothat of calculating the P matrix corresponding to Q_(λ). In the case ofa non-perfect fit, the two approaches are not quite equivalent butclose, especially when the interval between the flanking positions isnarrow, as is typically the case for the parents. Note that the samplingapproach is not the same as sampling from a weighted average of P_(l)and P_(r). Also, the average IBD probabilities Q_(λ) can be obtained viasampling of the two neighboring IBD matrices Q_(l) and Q_(r), similar to(11), i.e.,

$Q_{\lambda} = \{ \begin{matrix}{Q_{l}\mspace{14mu} {with}\mspace{14mu} {probability}\mspace{11mu} {( {\lambda - \lambda_{l}} )/( {\lambda_{r} - \lambda_{l}} )}} \\{Q_{r}\mspace{14mu} {with}\mspace{14mu} {probability}\mspace{14mu} {( {\lambda_{r} - \lambda} )/{( {\lambda_{r} - \lambda_{l}} ).}}}\end{matrix} $

Practical Advantage of the Disclosed Methods

In practice, the matrices used will incorporate information that hasbeen compiled regarding the pedigree of potential parental lines foreach biparental cross. While more pedigree information is preferred, theabove-described method is still beneficial if only a portion of therelevant pedigree information is known, or if only a portion of therelevant pedigree information is used for ease of analysis. One of theadvantages of the disclosed method is a reduction in the need to discardportions of relevant pedigree data in the interest of processingresources. In effect, the disclosed method allows the available pedigreedata to be summarized into an overall probability of descent of adesirable phenotype. This permits a much more efficient analysis whilestill using as much available data as possible, thereby increasing thepredictive accuracy, as all markers that show a predictive effect arepart of the analysis, even if not otherwise statistically significantfor traditional QTL analysis.

The disclosed method can also be used to calculate the parentaldissimilarity at a locus in the form of the actual or the effectivenumber of the latent ancestral classes. It is impossible to detect a QTLat loci for which the effective number is close to 1. The uncertaintyabout the inheritance of a particular individual in the set ofindividuals under consideration is provided by 1−1/n_(eff,i).

The present invention is further illustrated by the following examples.The foregoing and following description of the present invention and thevarious embodiments are not intended to be limiting of the invention butrather are illustrative thereof. Hence, it will be understood that theinvention is not limited to the specific details of these examples.

Example 1 Description of Classification Method

The identity-by-descent matrix Q is a square matrix containing ameasurement of the similarity between two individuals. The similaritymeasurement may be the probability that any two individuals share thesame allele of a common ancestor for a particular gene, marker,chromosomal segment, or haplotype. Equivalently, the Q matrix could havevalues that are an average similarity measure (possibly weighted) acrossspecific genomic segments or the whole genome. The values within the Qmatrix are determined by collecting pedigree and genotype information onindividuals in a population. The number of ancestral alleles is unknown.An example of such a matrix is shown in Table 2 for 8 individualslabeled A-H. A latent ancestor model with K disjoint latent classes inwhich each individual belongs to precisely one latent class is anexample method for assigning the individuals in Table 2 into groups. Thecrisp class memberships in latent class models can also be extended toprobabilistic measures. Let P be a matrix with elements p_(ik) being theprobability that individual i belongs to class k for i=1 . . . nindividuals and k=1 . . . K classes (see Eq. (1)).

Finding the matrix P that solves Q≈PP^(T) where P has non-negativeelements is one way to classify Q. Solving this equation by one of thealgorithms outlined above yields the matrix in Table 3 with K=5. Thecolumns of P in Table 3 are arranged in order of decreasing column sums.Note that the diagonal elements in PP^(T) are ignorable for theminimization problem. The approximation to Q is very good (RMSE=0.0001).

We now provide examples of the interpretation of statistics derived fromthe matrix P. In Table 3, four individuals can inherit from ancestor F1but the expected number of individuals that inherit from class F1 (thecolumn sum for F1) is only 2.98, because of the low probability forindividual B and the less than 1 probability for individual E. Theexpected number of individuals that inherit from the last class, F5, isonly 0.006. As a result, the measures for parental dissimilarity in thisexample, namely the actual and effective number of latent ancestralclasses, are 5 and 3.08, respectively. For individual H, the actual andeffective number of latent ancestral classes are 2 and 1.47,respectively. The uncertainty about the inheritance of this particularindividual in the set of individuals under consideration is1−1/n_(eff,i)=0.32, which the maximum of the inheritance uncertaintiesfor the individuals in Table 3. The inheritance uncertainty is zero forindividuals A, C, D and F.

Another method for classifying individuals can occur via a thresholdmethod that yields a binary model for Q. Transform the matrix Q into adiscrete matrix S by applying the rule that s_(ij)=0 if q_(ij)<t andotherwise s_(ij)=1. The variable t is the threshold and s_(u) is theidentity-by-descent status for individuals i and j. Using t=0.6 to the Qmatrix in Table 2 yields a binary P matrix with three groups (F1, F2,F3) of individuals as shown in Table 4. Many other possible methodsdocumented in the literature (e.g., Shepard and Arabie (1979), Sato andSato (1994)), could be used for assigning individuals from Table 2 intogroups but we do not describe those methods in detail here.

TABLE 2 Identity by descent matrix for 8 individuals labeled A-H wherenumber of ancestral alleles is unknown. Individual A B C D E F G H A 10.032 1 0 0.950 0 0 0 B 0.032 1 0.032 0.892 0.034 0.069 0.056 0.234 C 10.032 1 0 0.950 0 0 0 D 0 0.892 0 1 0 0 0 0.200 E 0.950 0.034 0.950 0 10.050 0.040 0.040 F 0 0.069 0 0 0.050 1 0.810 0.800 G 0 0.056 0 0 0.0400.810 1 0.648 H 0 0.234 0 0.200 0.040 0.800 0.648 1

TABLE 3 Classification of individuals in Table 2 using a latent ancestormodel. Individual F1 F2 F3 F4 F5 A 1 0 0 0 0 B 0.032 0.069 0.892 0.0010.006 C 1 0 0 0 0 D 0 0 1 0 0 E 0.950 0.050 0 0 0 F 0 1 0 0 0 G 0 0.8100 0.190 0 H 0 0.800 0.200 0 0

TABLE 4 Classification of individuals using a threshold method.Individual F1 F2 F3 A 1 0 0 B 0 0 1 C 1 0 0 D 0 0 1 E 1 0 0 F 0 1 0 G 01 0 H 0 1 0

Example 2 Application to Selection of Populations

The proposed invention may be used to select individuals for use in abreeding program as the parents of new populations.

Table 5 considers a set of existing information on the relative breedingvalue of two maize chromosomal segments. The first segment can be tracedto three different ancestral individuals, labeled white, grey, or black.These ancestral founders of the current breeding population have known apriori qualitative breeding values of good, average, and poor,respectively, for a given trait. At another genetic locus the same threefounder individuals are present but with a different set of breedingvalues. Although this example uses only three founder individuals itshould be obvious that the method can be extended to arbitrary numbersof founders and traits and that other breeding value ranking systemscould be used, such as quantitative metrics rather than the stylizedqualitative metric shown in Table 5. Additionally, and without loss ofgenerality, we will assume that the two loci shown in Table 5 combinetogether additively.

FIG. 1 shows the position of the two loci as being located on the samechromosome. Additionally, pedigree and genotypic information collectedon the founders and 8 individuals (labeled A-H) has been used togenerate identity by descent (IBD) data. Many standard techniques forcalculating IBD within a population are available in the literature(Meuwissen and Goddard (2007); Mao and Xu (2005); Besnier and Carlborg(2007); Wang et al. (1995); Heath (1997); Pong-Wong et al. (2001)) andcustomized methods can be built that extend these methods to particularreference populations. FIG. 1 uses the colors white, grey, and black toshow which of the three founders these 8 individuals carry on thischromosome.

A Q matrix for the allelic structure of the individuals at these twoloci can be constructed. In turn, this information can be used togenerate a second matrix indicating the founders of the 8 individuals atthe two loci of interest and thus to classify the individuals intogroups. For a simple pedigree, in which the founders are also the mostrecent common ancestors of the individuals, this second matrix is likelyto be similar (or even identical) to the 8×3 matrix containing theidentity by descent probabilities between each of the individuals andeach of the founders. It is thus possible to directly identify thelatent ancestors with the known founders of Table 5. Table 6 shows afurther classification of the 8 individuals into 5 groups depending onthe founders present at locus 1 and at locus 2.

The classification of these 8 individuals into groups, coupled with ourknowledge of the breeding value of locus 1 and locus 2, enables theselection of individuals for use in further crosses in the breedingprogram (Table 6, last column). In this example we show that individualsA, B, C, D, and E were selected for use as parents of new populationssince these individuals do not carry the poor allele at either locus.New populations generated from these individuals will be biased towardproducing favorable progeny.

This example can be generalized to the case where the founders areunknown and the IBD information is used with the disclosed method tomake inferences on the founders. In this case the founders are latent(unknown). The breeding value of the founders, such as in Table 5, canstill be determined as in the case that the 8 individuals are a subsetof a larger set for which some individuals are phenotyped. This showsthe generality of the selection method presented here.

TABLE 5 Breeding values for various founding individuals in ahypothetical population. Breeding Value Locus 1 Founder White Good GreyAverage Black Poor Locus 2 Founder White Average Grey Good Black Poor

TABLE 6 Classification of 8 individuals into into 5 groups depending onfounders present at locus 1 and 2. Locus 1 Locus 2 Individual IBD IBDGroup Select A White White 1 Yes B Grey Grey 2 Yes C Grey White 3 Yes DGrey Grey 2 Yes E Grey Grey 2 Yes F Black White 4 No G Black Black 5 NoH Black White 4 No

Example 3 Application to a Biparental Population

The proposed invention may be used to select individuals for use in abreeding program. In this example we detail one application of thisapproach for selecting individuals from a single biparental crossbetween two inbred parents.

Identity-by-descent (IBD) information is generated on a population basedon pedigree and marker data. Many standard techniques for calculatingIBD within a population are available in the literature (Meuwissen andGoddard (2007); Mao and Xu (2005); Besnier and Carlborg (2007); Wang etal. (1995); Heath (1997); Pong-Wong et al. (2001)) and customizedmethods can be built that extend these methods to particular referencepopulations. All the available pedigree and genotype information is usedto construct the IBD profile for individuals in the population. In thisexample we show that the parents of the biparental cross have a denseset of marker information available whereas the progeny are genotyped atfewer marker positions. This is a typical situation in commercial plantbreeding programs and a schematic representation is shown in FIG. 2.Additionally, the presence of a dense marker set for the parents allowsmultipoint techniques to be applied. Combining such techniques withaccurate pedigree records further improves the IBD calculation throughthe imputation of missing genotypes and the identification andcorrection of genotyping and/or pedigree errors.

The IBD information allows us to classify the progeny into groups;multiple methods for performing this classification were described inExample 1. In some parts of the genome the two parents of the progenyhave similar IBD information (the black bands in FIG. 2). In theseregions all individuals are assigned to a single group as the effectivenumber of latent ancestors is close to 1. That is, all individuals havecompletely similar genetics for these loci. In regions where the parentsdiffer in their IBD profiles (the white bands in FIG. 2) the alleles ofthe progeny can be classified into multiple groups based on matingdesign. For a set of doubled haploid individuals, for example, there aretwo possible groups for the progeny: individuals may have either thealleles of the first parent or those of the second parent.

The classification of the progeny into groups using highly accurate IBDinformation is then used to augment a statistical analysis of theprogeny. In this example we consider mapping of quantitative trait loci(QTL) as the statistical analysis technique. The classification of theprogeny allows us to consider a smaller number of marker positions,termed the effective progeny marker map, rather than the full set ofmarker loci at which the progeny have been genotyped. The effectiveprogeny marker map is created by not analyzing markers in regions of thegenome where no marker polymorphisms can occur. This reduces thecomputational complexity and potentially speeds up the QTL analysis.

FIG. 3 shows the results of such a QTL analysis on grain yield datacollected on 976 F5 maize topeross progenies at a single location in theUS corn belt. The data used for this analysis have been extensivelydescribed and published on elsewhere (Boer et al. (2007); Openshaw andFrascaroli (1997)). The QTL analysis in this case was conducted using a0.5 Bayesian Markov Chain Monte Carlo method (Bink et al. (2002)). Inthe top panel we see the QTL mapping results along one of ten maizechromosomes that are generated by considering the parents to beunrelated (UNR). Clear QTL signal is identified on this chromosome forgrain yield, particularly in two peaks on the top arm of the chromosome.In the bottom panel we have augmented the QTL analysis by consideringthe relatedness of the two parents to classify the progeny genotypesinto groups (IBDF). The genetic similarity between the two parents isshown between the two QTL profiles and clearly identifies regions wherethe two parents have identical IBD profiles (black bands). A preferredway to quantitatively define the genetic similarity here is 1/n_(eff)with n_(eff) calculated from the second matrix P. For the QTL peak nearthe middle of the chromosome the inclusion of the IBD information on theparents has allowed the analysis to exclude certain genomic regions aspotentially containing the QTL. In this way the QTL profile is muchsharper and specific when IBD information is included and provides abetter determination of QTL locations. We note that in this instance theoverall genetic signal detected in the IBDF approach (i.e. the estimatedtrait heritability) is similar to that identified when the parents areassumed to be unrelated (Table 7), both for grain moisture at harvestand grain yield.

TABLE 7 Genetic signal detected by IBDF approach and when parents areassumed to be related. H² UNR IBDF GRAIN YIELD 0.10 0.16 GRAIN MOISTURE0.38 0.35

The identified QTL peaks can then be used to select individuals from thebiparental cross for future use in the breeding program. In FIG. 4 weshow a histogram of the estimated genetic value for all 976 progenyobtained by summing QTL effects at each of the QTL positions identifiedon all ten chromosomes. The genetic values of the two inbred parents,labeled A and B in the figure, are shown for reference. The selection ofindividuals that outperform the yield of the top performing parent isindicated by the dashed vertical line. Other methods for selectingindividuals, both on a single or multiple trait basis, may also be used.

Example 4 Application to a Typical Testcross Population in a BreedingProgram

The proposed invention may be used to select individuals for use in abreeding program. In this example we detail one application of thisapproach for selecting individuals from multiple crosses between relatedinbred parents.

Populations comprised of multiple connected crosses between inbredparents are typical in plant breeding programs (e.g., Blanc et al.(2007)), particularly for the development of new inbred lines.Additionally, the detection of quantitative trait loci responsible forcomplex trait variation within such populations is of key interest tothe scientific community (e.g., U.S. Pat. No. 6,399,855).

The population for this example contained five inbred parents used infour connected crosses of 681 total double haploid maize toperossprogeny grown at 12 test locations throughout the United States. Themating design was a half-sib structure around one of the five inbredparents. Phenotypic data were collected on numerous traits usingstandard practices although only a trait related to maturity date wasanalyzed for this example.

As described previously, pedigree and genotype information can becollected on the set of parents which have been crossed. Thisinformation is used to calculate identity-by-descent (IBD) informationon the parents and progeny in the population and in turn used by themethods described in previous examples to classify the progeny intogroups (see Example 1 for descriptions of these methods). The resultinginformation was then used to augment a statistical analysis of thepopulation using Bayesian Markov Chain Monte Carlo methods for mappingquantitative trait loci (QTL; Bink et al., 2002, 2008).

FIG. 5 shows a QTL intensity profile from the analysis of these data forthe case where the parents are considered to be unrelated (UNR, toppanel) and when IBD information is used to augment the statisticalanalysis (IBDF, bottom panel). Clearly both analyses identify geneticsignal in similar positions along this particular chromosome. The UNRcase, however, presents a wider QTL credible interval (Gelman (2004))around the peak position. In the UNR case the QTL credible interval wasroughly 32 centiMorgans whereas when using IBD information in the mannerdescribed by the invention the QTL credible interval is reduced to aregion of 7 centiMorgans. The IBDF approach has provided a more preciselocalization of the QTL that is corroborated by extensive outsideevidence, wherein the specific gene associated with the time offlowering in maize, vgt1, has been positionally cloned to a 2 kilobaseregion on maize chromosome 8 within this interval (Salvi et al. (2007)).

Tables 8-10 help to explain why the IBDF approach was able to locatethis QTL more precisely than the UNR case. These tables show a sharplydefined region from 118-123 centiMorgans in which the recurrent parentin the half-sib mating design is clearly different from the other fourparents. Also, the other four parents are all highly related within thisregion. Analysis of the phenotypic and marker data, when combined withthis information, makes it possible to observe a QTL contrast betweenparent A and parents B-E within this region. At positions 117 cM and 124cM, however, parent A does have some relationship to parents B-E, whichmakes it unlikely that the QTL would reside at those positions. Becauseof these relationships the QTL analysis in the IBDF case positions theQTL always in the region from 117 cM to 124 cM and thus gives a narrowerQTL credible interval. The effective number of ancestral classes is veryclose to 1 (1.01) for Table 8 whereas this statistic equals 1.47 and1.31 for Tables 9 and 10, respectively. The P matrices from these Qmatrices clearly show that two ancestral classes are highly probablebetween 118 and 123 cM as well. A chief advantage of using computermethods for determining P is obvious to those skilled in the art, inthat human inspection of matrices like Tables 8-10 is impractical forthe large volume of marker data available on individuals of interest ina breeding program. Moreover, it should be clear that our method makesthe assignment of individuals to groups straightforward for Q matriceswith off-diagonal elements intermediate between 0 and 1.

TABLE 8 Q matrix at 117 cM. Individual A B C D E A 1 0.98 0.98 0.98 0.98B 0.98 1 1 1 1 C 0.98 1 1 1 1 D 0.98 1 1 1 1 E 0.98 1 1 1 1

TABLE 9 Q matrix averaged across 118-123 cM. Individual A B C D E A 10.01 0.01 0.01 0.01 B 0.01 1 1 1 1 C 0.01 1 1 1 1 D 0.01 1 1 1 1 E 0.011 1 1 1

TABLE 10 Q matrix at 124 cM. Individual A B C D E A 1 0.31 0.31 0.310.31 B 0.31 1 1 1 1 C 0.31 1 1 1 1 D 0.31 1 1 1 1 E 0.31 1 1 1 1

TABLE 11 P matrix for Q matrix in Table 8. Individual F1 F2 A 0.98 0.02B 1 0 C 1 0 D 1 0 E 1 0

TABLE 12 P matrix for Q matrix in Table 9. Individual F1 F2 A 0.01 0.99B 1 0 C 1 0 D 1 0 E 1 0

TABLE 13 P matrix for Q matrix in Table 10. Individual F1 F2 A 0.31 0.69B 1 0 C 1 0 D 1 0 E 1 0

Example 5 Application to a Typical Set of Hybrids in a Breeding Program

The proposed invention may be used to select individuals for use in abreeding program. In this example we detail one application of thisapproach for selecting F1 hybrid progeny.

Populations of F1 hybrid progeny are typical of breeding programs asthese represent potential products for commercial release. In thisexample we evaluated a population of 385 F1 hybrids grown in manylocations throughout the United States. Phenotypic data were collectedon numerous traits using standard practices although only a traitrelated to maturity date was analyzed for this example. Each F1 hybridis produced by crossing two inbred parents together.

Pedigree and genotype information was collected on the set of inbredparents that created the hybrid population. In the current example thismeant recording up to 18 generations of pedigree relationships andgenotypic information at 768 SNP markers throughout the genome. Thisinformation was used to calculate identity-by-descent information on theparents and F1 progeny as described in Example 3 and the thresholdmethod described in Example 1 was used to classify the progeny intogroups. The resulting matrix was then used to augment a statisticalanalysis of the F1 progeny using Bayesian Markov Chain Monte Carlomethods (Bink et al. (2002, 2008)).

A whole genome selection technique was chosen as the statisticaltechnique for evaluating the F1 hybrid progeny (e.g. Meuwissen et al.(2001)). FIG. 6 shows a comparison of the predicted maturity date totheir observed maturity date for all 385 hybrids when using traditionalpedigree-only approaches (PED, left panel) and when augmenting thestatistical analysis (IBDF, right panel). Both analysis methods hadcorrelation coefficients between 0.5 and 0.6 when comparing predictionsto observed data for this example. In the IBDF approach, however, the 18ancestral generations are not included in the analysis because all ofthose relationships are captured through the IBD matrices that we haveused to group the population. This reduces the computational complexityof the analysis and potentially reduces its run-time.

In addition the method can be used to choose the parents of new F1hybrid individuals for the breeding program. The IBDF approach was usedto predict genetic values for new, untested hybrid combinations based onthe augmented statistical analysis described above. FIG. 7 shows thecorrelation of the predicted values for the new, selected F1 hybridscompared to their observed measurements in a new experiment. In thisexample the IBDF approach had a linear correlation coefficient betweenthe predicted and observed values for maturity date of 0.71.

While the foregoing invention has been described in some detail forpurposes of clarity and understanding, it will be clear to one skilledin the art from a reading of this disclosure that various changes inform and detail can be made without departing from the true scope of theinvention. For example, all the techniques described above can be usedin various combinations.

The following references cited herein are hereby expressly incorporatedby reference in their entirety:

-   Besnier F, Carlborg O. A general and efficient method for estimating    continuous IBD functions for use in genome scans for QTL. BMC    Bioinformatics 2007; 8:440. doi: 10.1186/1471-2105-8-440.-   Bink, M. C. A. M., Uimari, P., Sillanpaa, M. J., Janss, L. L. G.;    Jansen, R. C., 2002. Multiple QTL mapping in related plant    populations via a pedigree-analysis approach. Theoretical and    applied genetics 104, 751-762.-   Bink, M. C. A. M., M. P. Boer, C. J. F. ter Braak, J. Jansen, R. E.    Voorrips & W. E. van de Weg, 2008. Bayesian analysis of complex    traits in pedigreed plant populations. Euphytica 161: 85-96.-   Blanc G, Charcosset A, Veyrieras J B, Gallais A, Moreau L.    Marker-assisted selection efficiency in multiple connected    populations: a simulation study based on the results of a QTL    detection experiment in maize. Euphytica 2008; 161: 71-84.-   Boer M, Wright D, Feng L, Podlich D, Luo L, Cooper M, Van Eeuwijk F.    A mixed model quantitative trait loci (QTL) analysis for multiple    environment trial data using environmental covariables for    QTL-by-environment interactions, with an example in maize. Genetics    2007; 177:1801-1813. doi:10.1534/genetics.107.071068.-   Gelman, Andrew et. al. Bayesian Data Analysis (Chapman & Hall 2004).-   Hill, M. O., 1973 Diversity and Evenness—Unifying Notation and Its    Consequences. Ecology 54: 427-432.-   Heath S C. Markov Chain Monte Carlo Segregation and Linkage Analysis    for Oligogenic Models. Am J Hum Genet. 1997; 61:748-760.-   Lawson, C. L. and Hanson, R. J., 1974. Solving least squares    problems. Prentice-Hall, Englewood Cliffs, N.J.-   Lawson, C. L. and Hanson, R. J., 1995. Solving Least Squares    Problems. Classics in Applied Mathematics. SIAM, Philadelphia.-   Mao Y, Xu S. A Monte Carlo algorithm for computing the IBD matrices    using incomplete marker information. Heredity 2005; 94:305-315.    doi:10.1038/sj.hdy.6800564.-   Malosetti, M, Visser, R G F, Celis-Gamboa, C, and van Eeuwijk, F A.    Qtl methodology for response curves on the basis of non-linear mixed    models, with an illustratrion to senescence in potato. Theor. Appl.    Genet. 2006; 113, 288-300. doi:10.1007/s00122-006-0294-2.-   Meuwissen T H E, Goddard M. Multipoint identity-by-descent    prediction using dense markers to map quantitative trait loci and    estimate effective population size. Genetics 2007; 176:2551-2560.    doi:10.1534/genetics.107.070953.-   Meuwissen T H E, Hayes B J, Goddard M E. Prediction of total genetic    value using genome-wide dense marker maps. Genetics 2001; 157:    1819-1829.-   Rosset, S. and Zhu, J., 2007. Piecewise linear regularized solution    paths Ann. Statist., 35, 1012-1030.-   Openshaw, S., and E. Frascaroli, 1997 QTL detection and marker    assisted selection for complex traits in maize. 52nd Annual Corn and    Sorghum Industry Research Conference. ASTA, Washington, D.C., pp.    44-53.-   Pong-Wong, R., A. W. George, J. A. Woolliams & C. S. Haley, 2001. A    simple and rapid method for calculating identity-by-descent matrices    using multiple markers. Genetics Selection Evolution 33: 453-471.-   Salvi S, Sponza G, Morgante M, Tomes D, Niu X, Fengler K, Meeley R,    Ananiev E, Svitashev S, Bruggemann E, Li B, Hainey C, Radovic S,    Zaina G, Rafalski J A, Tingey S, Miao G H, Phillips R, Tuberosa R.    Conserved noncoding genomic sequences associated with a    flowering-time quantitative trait locus in maize. PNAS 2007;    104:11376-11381.

Sato M, Sato Y. An additive fuzzy clustering model. 1994, JapaneseJournal of Fuzzy Theory and Systems, 6, 185-204.

-   Shepard R N, Arabie P. Additive clustering: representation of    similarities as combinations of discrete overlapping properties.    1979, Pyschological Review, 86, 87-123.-   Wang T, Fernando R L, Van der Beek S, Grossman M, Van Arendonk J    A M. Covarience between relatives for a marked quantitative trait    locus. Genet Sel evol. 1995; 27:251-274. doi: 10.1051/gse:19950304.

What is claimed is:
 1. A method for producing a selection matrix usefulfor selecting from a population a subset of individuals that have anincreased probability of producing progeny having one or morephenotypes, the method comprising: a. collecting pedigree and genotypeinformation pertaining to one or more loci in one or more individual ina population; b. building a first matrix by calculating the probabilitythat alleles present at one or more loci in the one or more individualsare identical by descent in the population, based on the pedigree andgenotype information; and c. using the first matrix to calculate aselection matrix that classifies one or more individuals from thepopulation into groups, wherein the selection matrix is calculated bygrouping individuals that exceed a threshold on the entries of the firstmatrix in a manner that enforces transitivity of group membership;wherein at least one step of the method is implemented on a computer. 2.The method of claim 1, wherein the first matrix is calculated using oneor more of the group consisting of stochastic algorithms, deterministicalgorithms, maximum likelihood algorithms, and descent graph methods. 3.The method of claim 1, wherein the selection matrix is calculated usinga latent class model, a non-negative matrix factorization model, or abilinear model.
 4. The method of claim 1, wherein the populationcomprises members of a plant species.
 5. The method of claim 4, whereinthe plant species is selected from the group consisting of maize, wheat,rice, millet, barley, sorghum, rye, soybean, alfalfa, oilseed Brassica,cotton, sunflower, potato, and tomato.
 6. The method of claim 1, whereinthe one or more phenotypes are selected from the group consisting ofyield, leaf angle, anthesis-silking interval (ASI), staygreen duration,early growth rate, overall growth rate, growth pattern, maximum biomass,total biomass, nitrogen use efficiency, water use efficiency, tocolcontent, oleic acid content, phytic acid content, amino acidcomposition, oil quality or quantity, energy availability,digestibility, fatty acid composition, a pathogen defense mechanism,lysine and sulfur levels, starch synthesis, disease resistance,herbicide resistance, male sterility, plant vigor, nutrient content,hemicellulose content, cellulose production, cold tolerance, salttolerance, heat tolerance, drought tolerance, grain moisture content,stalk lodging, root lodging, root pulling resistance, standestablishment, emergence, midsilk, test weight, protein content, starchpercentage, relative maturity, plant height, seed size, diseaseresistance genes, heading date, resistance to insects, brittle snap,stalk breakage, resistance to fungus, seed moisture, head shape,hullability, seedling vigor, beginning bloom date, maturity date, seedshatter, winter survival, fiber strength, ear height, plant barrenness,seed number, seed weight, and color grade.
 7. The method of claim 6,wherein the one or more phenotypes include drought tolerance.
 8. Themethod of claim 1, wherein the population is a doubled haploidpopulation.
 9. The method of claim 1, wherein the population is from abiparental cross.
 10. The method of claim 1, wherein the population is abackcross population.
 11. The method of claim 8, wherein the one or moretraits comprise an improvement in one or more phenotypes selected fromthe group consisting of yield, leaf angle, anthesis-silking interval(ASI), staygreen duration, early growth rate, overall growth rate,growth pattern, maximum biomass, total biomass, nitrogen use efficiency,water use efficiency, tocol content, oleic acid content, phytic acidcontent, amino acid composition, oil quality or quantity, energyavailability, digestibility, fatty acid composition, a pathogen defensemechanism, lysine and sulfur levels, starch synthesis, diseaseresistance, herbicide resistance, male sterility, plant vigor, nutrientcontent, hemicellulose content, cellulose production, cold tolerance,salt tolerance, heat tolerance, drought tolerance, grain moisturecontent, stalk lodging, root lodging, root pulling resistance, standestablishment, emergence, midsilk, test weight, protein content, starchpercentage, relative maturity, plant height, seed size, diseaseresistance genes, heading date, resistance to insects, brittle snap,stalk breakage, resistance to fungus, seed moisture, head shape,hullability, seedling vigor, beginning bloom date, maturity date, seedshatter, winter survival, fiber strength, ear height, plant barrenness,seed number, seed weight, and color grade.
 12. The method of claim 11,wherein the one or more phenotypes include drought tolerance.
 13. Amethod of predicting a polymorphic genomic region for a populationderived from a cross, the method comprising: a. collecting pedigree andgenotype information pertaining to one or more loci in at least oneindividual in a population; b. building a first matrix by calculatingthe probability that alleles present at one or more loci in the one ormore individuals are identical by descent in the population, based onthe pedigree and genotype information; c. using the first matrix toproduce a second matrix that classifies one or more individuals from thepopulation; d. using the second matrix to select a subset of theindividuals from the population that have an increased probability ofhaving at least one polymorphic genomic region; and, e. breeding theselected individuals; wherein at least one step of the method isimplemented on a computer.
 14. The method of claim 13, wherein thepolymorphic genomic region is associated with a trait of interest. 15.The method of claim 13, wherein the population comprises members of aplant species.
 16. The method of claim 15, wherein the plant species isselected from the group consisting of maize, wheat, rice, millet,barley, sorghum, rye, soybean, alfalfa, oilseed Brassica, cotton,sunflower, potato, and tomato.
 17. The method of claim 14, wherein thetrait of interest is selected from the group consisting of yield, leafangle, anthesis-silking interval (ASI), staygreen duration, early growthrate, overall growth rate, growth pattern, maximum biomass, totalbiomass, nitrogen use efficiency, water use efficiency, tocol content,oleic acid content, phytic acid content, amino acid composition, oilquality or quantity, energy availability, digestibility, fatty acidcomposition, a pathogen defense mechanism, lysine and sulfur levels,starch synthesis, disease resistance, herbicide resistance, malesterility, plant vigor, nutrient content, hemicellulose content,cellulose production, cold tolerance, salt tolerance, heat tolerance,drought tolerance, grain moisture content, stalk lodging, root lodging,root pulling resistance, stand establishment, emergence, midsilk, testweight, protein content, starch percentage, relative maturity, plantheight, seed size, disease resistance genes, heading date, resistance toinsects, brittle snap, stalk breakage, resistance to fungus, seedmoisture, head shape, hullability, seedling vigor, beginning bloom date,maturity date, seed shatter, winter survival, fiber strength, earheight, plant barrenness, seed number, seed weight, and color grade. 18.The method of claim 13, wherein the population is a doubled haploidpopulation.
 19. The method of claim 18, wherein the polymorphic genomicregion is associated with drought tolerance.